Introduction

Row

Overview

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

  • DEFINE the Problem
  • COLLECT the Data from Appropriate Sources
  • ORGANIZE the Data Collected
  • VISUALIZE the Data by Developing Charts
  • ANALYZE the data with Appropriate Statistical Methods
  • COMMUNICATE your Results

Row

The Problem & Data Collection

The Problem

The goal of the analysis is to determine which variables are best at predicting whether a customer will “churn”. Churn means the customer stopped using the company’s product or service. The dataset is for a telecommunications company called Telco so in the context of the data churn means that the customer terminated their subscription. With this analysis, Telco will have a better idea of what factors contribute to their clients leaving and therefore develop a plan to retain them.

The Data

This dataset has 7,044 and 20 variables. For this analysis, we will ignore the CustomerID variable which serves as a unique identifier for each customer and does not provide any meaningful information for predicting customer churn.

Data Sources

Author(s): Steven Macko Title: Telco Customer Churn Year: 2018 Version: 1 Publisher: IBM/Kaggle URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

The Data

VARIABLES TO PREDICT WITH

  • Gender: Customer’s gender (Male/Female)
  • SeniorCitizen: Whether the customer is a senior citizen (Yes(0)/No(1))
  • Partner: Whether the customer has a partner (Yes(0)/No(1))
  • Dependents: Whether the customer has dependents (Yes(0)/No(1))
  • PhoneService: Whether the customer has phone service (Yes(0)/No(1))
  • MultipleLines: Whether the customer has multiple phone lines (Yes(0)/No(1)/No phone service(2))
  • InternetService: If Customer Has Internet and the Type (DSL/Fiber optic/No(1))
  • OnlineSecurity: If Customer Has Security on Their Internet (Yes(0)/No(1)/No Internet(2))
  • OnlineBackup: If Customer Has a Backup for Their Internet (Yes(0)/No(1)/No Internet(2))
  • DeviceProtection: If Customer has Device Protections Services for Their Internet (Yes(0)/No(1)/No Internet(2))
  • TechSupport: Whether Customer has Tech Support Enabled on Their Subscription (Yes(0)/No(1)/No Internet(2))
  • StreamingTV: If the Customer Has StreamingTV Support on Their Subscription (Yes(0)/No(1)/No Internet(2))
  • StreamingMovies: If the Customer Has Movie Streaming Support on Their Subscription (Yes(0)/No(1)/No Internet(2))
  • Contract: Type of contract (Month-to-month/One year/Two years)
  • Paperless: If the Customer Has Opted for Paperless Billing (Yes(0)/No(1))
  • PaymentMethod: Payment method (Electronic check/Mailed check/Bank transfer/Credit card)
  • MonthlyCharges: Monthly charges for the customer (continuous variable)
  • TotalCharges: Total charges accumulated by the customer (continuous variable)

VARIABLES WE WANT TO PREDICT

  • Churn: Whether the customer churned (Yes(0)/No(1)), Quantitative, response variable)
  • Tenure: # of Months with Subscription (continuous, response variable)

Data

Column

Organize the Data

Organizing data can also include summarizing data values in simple one-way and two-way tables.

  customerID           gender          SeniorCitizen       Partner     
 Length:7043        Length:7043        Min.   :0.0000   Min.   :0.000  
 Class :character   Class :character   1st Qu.:0.0000   1st Qu.:0.000  
 Mode  :character   Mode  :character   Median :0.0000   Median :1.000  
                                       Mean   :0.1621   Mean   :0.517  
                                       3rd Qu.:0.0000   3rd Qu.:1.000  
                                       Max.   :1.0000   Max.   :1.000  
                                                                       
   Dependents         tenure       PhoneService     MultipleLines  
 Min.   :0.0000   Min.   : 0.00   Min.   :0.00000   Min.   :0.000  
 1st Qu.:0.0000   1st Qu.: 9.00   1st Qu.:0.00000   1st Qu.:0.000  
 Median :1.0000   Median :29.00   Median :0.00000   Median :1.000  
 Mean   :0.7004   Mean   :32.37   Mean   :0.09683   Mean   :0.675  
 3rd Qu.:1.0000   3rd Qu.:55.00   3rd Qu.:0.00000   3rd Qu.:1.000  
 Max.   :1.0000   Max.   :72.00   Max.   :1.00000   Max.   :2.000  
                                                                   
 InternetService    OnlineSecurity  OnlineBackup    DeviceProtection
 Length:7043        Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
 Class :character   1st Qu.:0.00   1st Qu.:0.0000   1st Qu.:0.0000  
 Mode  :character   Median :1.00   Median :1.0000   Median :1.0000  
                    Mean   :0.93   Mean   :0.8718   Mean   :0.8728  
                    3rd Qu.:1.00   3rd Qu.:1.0000   3rd Qu.:1.0000  
                    Max.   :2.00   Max.   :2.0000   Max.   :2.0000  
                                                                    
  TechSupport      StreamingTV     StreamingMovies    Contract        
 Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Length:7043       
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
 Median :1.0000   Median :1.0000   Median :1.0000   Mode  :character  
 Mean   :0.9265   Mean   :0.8323   Mean   :0.8288                     
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000                     
 Max.   :2.0000   Max.   :2.0000   Max.   :2.0000                     
                                                                      
 PaperlessBilling PaymentMethod      MonthlyCharges    TotalCharges   
 Min.   :0.0000   Length:7043        Min.   : 18.25   Min.   :  18.8  
 1st Qu.:0.0000   Class :character   1st Qu.: 35.50   1st Qu.: 401.4  
 Median :0.0000   Mode  :character   Median : 70.35   Median :1397.5  
 Mean   :0.4078                      Mean   : 64.76   Mean   :2283.3  
 3rd Qu.:1.0000                      3rd Qu.: 89.85   3rd Qu.:3794.7  
 Max.   :1.0000                      Max.   :118.75   Max.   :8684.8  
                                                      NA's   :11      
     Churn       
 Min.   :0.0000  
 1st Qu.:0.0000  
 Median :1.0000  
 Mean   :0.7346  
 3rd Qu.:1.0000  
 Max.   :1.0000  
                 

From this data we can see that our variables have a variety of different values and a wide variety of variable types. CustomerID is a unique identifier for each customer and thus serves no purpose in our analysis so we will remove it. Our two dependent variables are Churn and Tenure. Churn is binary with 0 being Yes they did churn and 0 being no they did not churn. Tenure is continuous and we can see that the max tenure is 72 months and the median is 29 months of being subscribed to Telco’s service. SeniorCitizen, PhoneService, Partner, and Dependents were recoded from yes/no to 0/1 for easier analysis. Similarly, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies were recoded from Yes/No/No Internet to 0/1/2 for easier analysis. Moreover, MonthlyCharges has a max of 118.75 and a median of 70.35 while TotalCharges has a max of 8684.8 and a median of 1397.5

Column

Transform Variables

My RStudio was having problems running the code with the variables as factors. Therefore, I had to recode my excel file to replace the characters of these variables with numbers and set them as numeric.

Customer Churn(Yes(0)/No(1)) & InternetService(DSL,Fiber Optic, No Internet(1))

# A tibble: 2 × 2
  Churn     n
  <chr> <int>
1 0      1869
2 1      5174
# A tibble: 3 × 2
  InternetService     n
  <chr>           <int>
1 1                1526
2 DSL              2421
3 Fiber optic      3096

Customer Churn (Yes or No)

Data Viz #1

Column

Response Variables

Churn Yes(0)/No(1)

#{r, cache=TRUE} #as_tibble(select(telcodf,Churn) %>% ##ggplot(aes(y=n,x=Churn)) + geom_bar(stat="identity")

We can see we have about 73% of the data as no customer churn and 26.5% that have churned. Looking at the potential predictors related to Customer Churn, we strongest relationships between Tenure, MonthlyCharges, and TotalCharges. The rest of the variable comparisons are further down.

Column

Transform Variables

Data Viz #2 =======================================================================

Column

Response Variables

Tenure & Churn

We see the largest concentration of values are at the start and end of the histogram, at 0-20 years and 60-73 years. Looking at the potential predictors related to Tenure, the strongest relationship occurs between MonthlyCharges. Although the only two other continuous variables are MonthlyCharges and TotalCharges, MonthlyCharges has a relatively high correlation (.826) while TotalCharges is smaller (.248).The data also appears to be right skewed due to the concentration on the left of the histogram. We can see a large number of values around 73+ due to truncation of the tenure variable or perhaps because of unaccounted for noise. The large number at 0 is likely due to customers staying for less than a year.

The Churn variable is binary and thus cannot be made into a histogram. Based on the Churn bar chart, far more customers have not churned than have churned.

Column

Transform Variables

Churn Analysis {data-orientation=rows} =======================================================================

Row

Predict Customer Churn (Yes(0)/No(1))

For this analysis we will use a Linear Regression Model.

Adjusted R-Squared

28 %

RMSE

0.37

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
ContractOne year 0.106 0.014 7.549 0.000
TotalCharges 0.000 0.000 6.852 0.000
InternetServiceFiber optic -0.354 0.058 -6.111 0.000
PaymentMethodElectronic check -0.068 0.013 -5.086 0.000
PaperlessBilling 0.045 0.010 4.495 0.000
ContractTwo year 0.070 0.017 4.110 0.000
tenure 0.002 0.001 3.917 0.000
SeniorCitizen -0.044 0.013 -3.419 0.001
MultipleLines 0.059 0.024 2.403 0.016
InternetServiceDSL -0.143 0.076 -1.895 0.058
Dependents -0.020 0.011 -1.766 0.078
TechSupport -0.044 0.025 -1.754 0.079
OnlineSecurity -0.043 0.025 -1.710 0.087
StreamingMovies 0.066 0.045 1.460 0.144
StreamingTV 0.064 0.045 1.416 0.157
(Intercept) 0.616 0.461 1.337 0.181
PhoneService -0.064 0.069 -0.924 0.355
PaymentMethodMailed check 0.007 0.015 0.465 0.642
OnlineBackup -0.011 0.024 -0.462 0.644
PaymentMethodCredit card (automatic) 0.006 0.014 0.448 0.654
genderMale 0.003 0.009 0.375 0.707
MonthlyCharges 0.001 0.004 0.303 0.762
DeviceProtection 0.005 0.025 0.185 0.853
Partner -0.001 0.011 -0.079 0.937

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, we determine that there are some predictors that are not important in predicting customer churn, so a pruned version of the model is created by removing predictors that are not significant.

Row

Predict Customer Churn Final Version

For this analysis we will use a pruned Linear Regression Model. The variables we removed include DeviceProtection (if customer has Device Protection), OnlineSecurity (If the customer has Online Security), StreamingTV (If they have Streaming TV support), PhoneService (if customer has phone service), Partner (if the customer has a partner), and Gender (Male or Female).

Adjusted R-Squared

27 %

RMSE

0.38

Row

Regression Output

Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.006 0.051 19.555 0.000
InternetServiceFiber optic -0.310 0.039 -8.042 0.000
ContractOne year 0.110 0.014 7.860 0.000
TotalCharges 0.000 0.000 6.890 0.000
InternetServiceDSL -0.190 0.029 -6.438 0.000
PaymentMethodElectronic check -0.077 0.013 -5.815 0.000
PaperlessBilling 0.054 0.010 5.362 0.000
TechSupport -0.066 0.013 -5.173 0.000
ContractTwo year 0.074 0.017 4.355 0.000
SeniorCitizen -0.052 0.013 -3.990 0.000
MonthlyCharges -0.002 0.001 -3.825 0.000
tenure 0.002 0.000 3.556 0.000
OnlineBackup -0.032 0.012 -2.706 0.007
Dependents -0.023 0.010 -2.234 0.026
PaymentMethodMailed check 0.012 0.015 0.794 0.427
PaymentMethodCredit card (automatic) 0.007 0.014 0.529 0.597
MultipleLines 0.002 0.010 0.190 0.850

Residual Assumptions Explorations

Row

Analysis Summary

After examining this model, looking at the residual plots we can see that the data is not perfect and there are some problems. There are some high values at the right of the Q-Q plot that may be due to the truncated nature of the data. The deviations from the line represent departures from normality. The curved nature of the Q-Q plot may suggest that there is less variance than expected.

For the Residuals vs Fitted plot, we can see two very distinct lines which means that there is a problem with the data set. Ideally, the plot would show the residuals randomly scattered to indicate a consistent and unbiased fit but we do not see that here.

Reducing the predictors that did not help with prediction of customer churn actually had a negative impact on our fit statistics (R-square and RMSE (root mean squared error)) as they both slightly decreased by 1%.

From the following table, we can see the effect on Customer Churn by the predictor variables.

Variable Direction
ContractOne year Decrease
TotalCharges Increase
MonthlyCharges Increase
MultipleLines Increase
TechSupport Decrease
PaymentMethodElectronic check Increase
InternetServiceDSL Decrease
PaperlessBilling Increase
ContractTwo year Decrease
tenure Decrease
SeniorCitizen Increase
OnlineBackup Decrease
InternetServiceFiber optic Decrease
Dependents Decrease
PaymentMethodMailed check Decrease
PaymentMethodCredit card (automatic) Decrease

Tenure Analysis

Row

Predict Customer Tenure

Conclusion 1

Summary

In Conclusion, we can see that our predictors do help to predict a customers tenure, with an r-squared value of .87. The most significant predictors for tenure are TotalCharges, MonthlyCharges, Contract, and PaymentMethod.

From this analysis, we can see that as these variables increase they:
Decrease_Tenure Increase_Tenure
If the customer has Multiple Lines Customers Total Charges
Whether the customer has churned (Churn) Customer Monthly Charges
If they are a Senior Citizen (SeniorCitizen) Length of Contract
N/A Customers Payment Method
N/A If the customer has a partner (Partner)

Additional Churn Analysis 1

Row

Predict Churn - Logistic Regression

Conclusion 2

Summary

In Conclusion, we can see that our predictors do help to predict whether a customer will churn, with Tenure, InternetService, and Contract being the most significant variables.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
Decrease_Prob_to_Churn Increase_Prob_to_Churn
Customer Tenure Customers Total Charges
Whether the customer has online security (OnlineSecurity) Customers Payment Method
Whether the customer is a senior citizen (SeniorCitizen) The length of customer contract (Contract)
Whether the customer has internet service (InternetService) Whether the customer has paperless billing (PaperlessBilling)
Whether the customer has multple phone lines (MultipleLines) N/A

Additional Churn Analysis 2

Row

Predict Churn - Decision Tree

---
title: "Belden INFO 3200 Project"
output: 
  flexdashboard::flex_dashboard:
    vertical_layout: scroll
    source_code: embed
---

```{r setup, include=FALSE, warning=FALSE}
#include=FALSE will not include r code in output
#warning=FALSE will remove any warnings from output
library(flexdashboard)
library(tidyverse)
library(GGally)
library(caret) #for logistic regression
library(broom) #for tidy() function
```

```{r load_data}
telcodf <- read_csv("Belden-Telco-Customer-Churnv4.csv")
```

Introduction {data-orientation=rows}
=======================================================================

Row {data-height=250}
-----------------------------------------------------------------------

### Overview 

For this project, we will follow the DCOVAC process. The process is listed below:

DCOVAC – THE DATA MODELING FRAMEWORK

* DEFINE the Problem
* COLLECT the Data from Appropriate Sources
* ORGANIZE the Data Collected
* VISUALIZE the Data by Developing Charts
* ANALYZE the data with Appropriate Statistical Methods
* COMMUNICATE your Results

Row {data-height=650}
-----------------------------------------------------------------------

### The Problem & Data Collection

#### The Problem
The goal of the analysis is to determine which variables are best at predicting whether a customer will “churn”. Churn means the customer stopped using the company's product or service. The dataset is for a telecommunications company called Telco so in the context of the data churn means that the customer terminated their subscription. With this analysis, Telco will have a better idea of what factors contribute to their clients leaving and therefore develop a plan to retain them.

#### The Data
This dataset has 7,044 and 20 variables. For this analysis, we will ignore the `CustomerID` variable which serves as a unique identifier for each customer and does not provide any meaningful information for predicting customer churn.

#### Data Sources
Author(s): Steven Macko
Title: Telco Customer Churn
Year: 2018
Version: 1
Publisher: IBM/Kaggle
URL: https://www.kaggle.com/datasets/blastchar/telco-customer-churn


### The Data
VARIABLES TO PREDICT WITH

* **Gender**: Customer's gender (Male/Female)
* **SeniorCitizen**: Whether the customer is a senior citizen (Yes(0)/No(1))
* **Partner**: Whether the customer has a partner (Yes(0)/No(1))
* **Dependents**: Whether the customer has dependents (Yes(0)/No(1))
* **PhoneService**: Whether the customer has phone service (Yes(0)/No(1))
* **MultipleLines**: Whether the customer has multiple phone lines (Yes(0)/No(1)/No phone service(2))
* **InternetService**: If Customer Has Internet and the Type (DSL/Fiber optic/No(1))
* **OnlineSecurity**: If Customer Has Security on Their Internet (Yes(0)/No(1)/No Internet(2))
* **OnlineBackup**: If Customer Has a Backup for Their Internet (Yes(0)/No(1)/No Internet(2))
* **DeviceProtection**: If Customer has Device Protections Services for Their Internet (Yes(0)/No(1)/No Internet(2))
* **TechSupport**: Whether Customer has Tech Support Enabled on Their Subscription (Yes(0)/No(1)/No Internet(2))
* **StreamingTV**: If the Customer Has StreamingTV Support on Their Subscription (Yes(0)/No(1)/No Internet(2))
* **StreamingMovies**: If the Customer Has Movie Streaming Support on Their Subscription (Yes(0)/No(1)/No Internet(2))
* **Contract**: Type of contract (Month-to-month/One year/Two years)
* **Paperless**: If the Customer Has Opted for Paperless Billing (Yes(0)/No(1))
* **PaymentMethod**: Payment method (Electronic check/Mailed check/Bank transfer/Credit card)
* **MonthlyCharges**: Monthly charges for the customer (continuous variable)
* **TotalCharges**: Total charges accumulated by the customer (continuous variable)

 

VARIABLES WE WANT TO PREDICT

* *Churn*:  Whether the customer churned (Yes(0)/No(1)), Quantitative, response variable)
* *Tenure*: # of Months with Subscription (continuous, response variable)

Data
=======================================================================


Column {data-width=650}
-----------------------------------------------------------------------
### Organize the Data
Organizing data can also include summarizing data values in simple one-way and two-way tables.

```{r, cache=TRUE}
#the cache=TRUE can be removed. This will allow you to rerun your code without it having to run EVERYTHING from scratch every time. If the output seems to not reflect new updates, you can choose Knit, Clear Knitr cache to fix.

#Clean data by replacing spaces with decimals
colnames(telcodf) <- make.names(colnames(telcodf))
#View data
summary(telcodf)
#remove customerID due to it being an identifier
telcodf <- select(telcodf, -customerID)
```
From this data we can see that our variables have a variety of different values and a wide variety of variable types. CustomerID is a unique identifier for each customer and thus serves no purpose in our analysis so we will remove it. Our two dependent variables are Churn and Tenure. Churn is binary with 0 being Yes they did churn and 0 being no they did not churn. Tenure is continuous and we can see that the max tenure is 72 months and the median is 29 months of being subscribed to Telco's service. SeniorCitizen, PhoneService, Partner, and Dependents were recoded from yes/no to 0/1 for easier analysis. Similarly, MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies were recoded from Yes/No/No Internet to 0/1/2 for easier analysis. Moreover, MonthlyCharges has a max of 118.75 and a median of 70.35 while TotalCharges has a max of 8684.8 and a median of 1397.5


Column {data-width=350}
-----------------------------------------------------------------------
### Transform Variables
My RStudio was having problems running the code with the variables as factors. Therefore, I had to recode my excel file to replace the characters of these variables with numbers and set them as numeric. 
```{r, cache=TRUE}
telcodf <- mutate(telcodf,
                  SeniorCitizen = as.numeric(SeniorCitizen),
                  Partner = as.numeric(Partner),
                  Dependents = as.numeric(Dependents),
                  MultipleLines = as.numeric(MultipleLines),
                  OnlineSecurity = as.numeric(OnlineSecurity),
                  OnlineBackup = as.numeric(OnlineBackup),
                  DeviceProtection = as.numeric(DeviceProtection),
                  TechSupport = as.numeric(TechSupport),
                  StreamingTV = as.numeric(StreamingTV),
                  StreamingMovies = as.numeric(StreamingMovies),
                  PaperlessBilling = as.numeric(PaperlessBilling),
                  Churn = as.numeric(Churn))

```
#### Customer Churn(Yes(0)/No(1)) & InternetService(DSL,Fiber Optic, No Internet(1))
```{r cache=TRUE}
as_tibble(select(telcodf,Churn) %>%
  table())
as_tibble(select(telcodf,InternetService) %>%
  table())

```
#### Customer Churn (Yes or No)

<!--Instructions to import .jpg or .png images
use getwd() to see current path structure 
copy file into same place as .Rmd file
put the path to this file in the link
format: ![Alt text](book.jpg) -->

![](ChurnDistributionJMP.jpg)


Data Viz #1
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables
#### Churn Yes(0)/No(1)
```#{r, cache=TRUE}
#as_tibble(select(telcodf,Churn) %>%
  ##ggplot(aes(y=n,x=Churn)) + geom_bar(stat="identity")
```

We can see we have about 73% of the data as no customer churn and 26.5% that have churned. Looking at the potential predictors related to Customer Churn, we strongest relationships between Tenure, MonthlyCharges, and TotalCharges. The rest of the variable comparisons are further down.


Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables



```{r, cache=TRUE}
ggpairs(select(telcodf,Churn,tenure,MonthlyCharges,TotalCharges,Contract))
```
Data Viz #2
=======================================================================


Column {data-width=500}
-----------------------------------------------------------------------
### Response Variables

#### Tenure & Churn
```{r, cache=TRUE}
ggplot(telcodf, aes(tenure)) + geom_histogram(bins=60)

ggplot(telcodf, aes(x = Churn)) + geom_bar()

```

We see the largest concentration of values are at the start and end of the histogram, at 0-20 years and 60-73 years. Looking at the potential predictors related to Tenure, the strongest relationship occurs between MonthlyCharges. Although the only two other continuous variables are MonthlyCharges and TotalCharges, MonthlyCharges has a relatively high correlation (.826) while TotalCharges is smaller (.248).The data also appears to be right skewed due to the concentration on the left of the histogram. We can see a large number of values around 73+ due to truncation of the tenure variable or perhaps because of unaccounted for noise. The large number at 0 is likely due to customers staying for less than a year.

The Churn variable is binary and thus cannot be made into a histogram. Based on the Churn bar chart, far more customers have not churned than have churned.

Column {data-width=500}
-----------------------------------------------------------------------

### Transform Variables

```{r, cache=TRUE}
ggpairs(select(telcodf,Churn,tenure,SeniorCitizen,Partner,Dependents,PhoneService))

ggpairs(select(telcodf,Churn,tenure,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection))

ggpairs(select(telcodf,Churn,tenure,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,PaymentMethod))


```
Churn Analysis {data-orientation=rows}
=======================================================================

Row
-----------------------------------------------------------------------

### Predict Customer Churn (Yes(0)/No(1))
For this analysis we will use a Linear Regression Model.

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Churn_lm <- lm(Churn ~ . ,data = telcodf)
summary(Churn_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Churn_lm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(Churn_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(Churn_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```

Row
-----------------------------------------------------------------------

### Regression Output

```{r,include=FALSE, cache=TRUE}
#knitr::kable(summary(MEDV_lm)$coef, digits = 3) #pretty table output
summary(Churn_lm)$coef
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Churn_lm))[,4])  
out <- coef(summary(Churn_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(Churn_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, we determine that there are some predictors that are not important in predicting customer churn, so a pruned version of the model is created by removing predictors that are not significant.

Row
-----------------------------------------------------------------------

### Predict Customer Churn Final Version
For this analysis we will use a pruned Linear Regression Model. The variables we removed include DeviceProtection (if customer has Device Protection), OnlineSecurity (If the customer has Online Security), StreamingTV (If they have Streaming TV support), PhoneService (if customer has phone service), Partner (if the customer has a partner), and Gender (Male or Female).

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
Churn_lm <- lm(Churn ~ . -DeviceProtection -OnlineSecurity -StreamingMovies -StreamingTV -PhoneService -Partner -gender,data = telcodf)
summary(Churn_lm)
```

```{r, include=FALSE, cache=TRUE}
#the include=FALSE hides the output - remove to see
tidy(Churn_lm)
```

### Adjusted R-Squared

```{r, cache=TRUE}
ARSq<-round(summary(Churn_lm)$adj.r.squared,2)
valueBox(paste(ARSq*100,'%'), icon = "fa-thumbs-up")
```

### RMSE

```{r, cache=TRUE}
Sig<-round(summary(Churn_lm)$sigma,2)
valueBox(Sig, icon = "fa-thumbs-up")
```


Row
-----------------------------------------------------------------------

### Regression Output

```{r, include=FALSE, cache=TRUE}
knitr::kable(summary(Churn_lm)$coef, digits = 3) #pretty table output
```

```{r, cache=TRUE}
# this version sorts the p-values (it is using an index to reorder the coefficients)
idx <- order(coef(summary(Churn_lm))[,4])  
out <- coef(summary(Churn_lm))[idx,] 
knitr::kable(out, digits = 3) #pretty table output
```

### Residual Assumptions Explorations

```{r, cache=TRUE}
plot(Churn_lm, which=c(1,2)) #which tells which plots to show (1-6 different plots)
```

Row
-----------------------------------------------------------------------

### Analysis Summary
After examining this model, looking at the residual plots we can see that the data is not perfect and there are some problems. There are some high values at the right of the Q-Q plot that may be due to the truncated nature of the data. The deviations from the line represent departures from normality. The curved nature of the Q-Q plot may suggest that there is less variance than expected.

For the Residuals vs Fitted plot, we can see two very distinct lines which means that there is a problem with the data set. Ideally, the plot would show the residuals randomly scattered to indicate a consistent and unbiased fit but we do not see that here. 

Reducing the predictors that did not help with prediction of customer churn actually had a negative impact on our fit statistics (R-square and RMSE (root mean squared error)) as they both slightly decreased by 1%.

From the following table, we can see the effect on Customer Churn by the predictor variables.

```{r, cache=TRUE}
#create table summary of predictor changes
predchang = data.frame(
  Variable = c('ContractOne year', 'TotalCharges', 'MonthlyCharges', 'MultipleLines', 'TechSupport','PaymentMethodElectronic check', 'InternetServiceDSL', 'PaperlessBilling','ContractTwo year', 'tenure', 'SeniorCitizen', 'OnlineBackup','InternetServiceFiber optic', 'Dependents', 'PaymentMethodMailed check','PaymentMethodCredit card (automatic)'),
  Direction = c('Decrease', 'Increase', 'Increase', 'Increase', 'Decrease','Increase', 'Decrease', 'Increase', 'Decrease','Decrease','Increase', 'Decrease', 'Decrease', 'Decrease', 'Decrease','Decrease')
)
knitr::kable(predchang) #pretty table output

```

Tenure Analysis {data-orientation=rows}
=======================================================================

Row {data-height=900}
-----------------------------------------------------------------------

### Predict Customer Tenure
![](Tenure-Prediction.jpg)


Conclusion 1
=======================================================================
### Summary

In Conclusion, we can see that our predictors do help to predict a customers tenure, with an r-squared value of .87. The most significant predictors for tenure are TotalCharges, MonthlyCharges, Contract, and PaymentMethod.

From this analysis, we can see that as these variables increase they:
```{r}
#final table summary of predictor changes
predtenurefnl = data_frame(Decrease_Tenure = 
                            c("If the customer has Multiple Lines",
                             "Whether the customer has churned (Churn)",
                            "If they are a Senior Citizen (SeniorCitizen)",
                            "N/A", "N/A"
                            ),
                    Increase_Tenure = c("Customers Total Charges",
                                          "Customer Monthly Charges",
                                        "Length of Contract",
                                        "Customers Payment Method",
                                        "If the customer has a partner (Partner)"
                                ))  
knitr::kable(predtenurefnl) #pretty table output
```

Additional Churn Analysis 1 {data-orientation=rows}
=======================================================================

Row {data-height=900}
-----------------------------------------------------------------------

### Predict Churn - Logistic Regression
![](Churn-NominalLogisticV2.jpg)

Conclusion 2
=======================================================================
### Summary

In Conclusion, we can see that our predictors do help to predict whether a customer will churn, with Tenure, InternetService, and Contract being the most significant variables.

Combining the results of both types of predictor models and only reporting where agreement was found, we can see that as these variables increase they:
```{r}
#final table summary of predictor changes
predchurnfnl = data_frame(Decrease_Prob_to_Churn = 
                            c("Customer Tenure",
                            "Whether the customer has online security (OnlineSecurity)",
                            "Whether the customer is a senior citizen (SeniorCitizen)",
                            "Whether the customer has internet service (InternetService)",
                            "Whether the customer has multple phone lines (MultipleLines)"
                            ),
                    Increase_Prob_to_Churn = c("Customers Total Charges",
                            "Customers Payment Method",
                            "The length of customer contract (Contract)",
                            "Whether the customer has paperless billing (PaperlessBilling)",
                            "N/A"
                            ))
knitr::kable(predchurnfnl) #pretty table output
```

Additional Churn Analysis 2 {data-orientation=rows}
=======================================================================

Row {data-height=900}
-----------------------------------------------------------------------

### Predict Churn - Decision Tree
![](Churn-DecisionTree.jpg)